Data exploration is not only about creating numbers and summary statistics. Sometimes a good plot can reveal more insights than a whole dataframe filled with numbers (especially to the human eye). In this exercise we make use of what we’ve just learned about plots with ggplot2. This time we are going to use all of the Gapminder GDP data.

1

Load the gapminder GDP data from the Excel file and convert it to long format (as in the Summary Statistics exercise). You can simply reuse the code from the exercises on Summary statistics for this, but make sure that you do not exclude data for the years 1970 to 2001.
Remember that we used the filter() function for choosing the individual time periods, so you need to exclude that line from the previous exercises here.

In the previous exercises we only analyzed how the period from 1960 to 1969 compares to the period from 2002 to 2011. The nice thing about plots is that we can make use of the whole range of years and still identify differences between various periods. Our plot of choice, therefore, is a line plot to create a nice time series.

2

Plot the gapminder data as a line plot to display a time series.
Instead of geom_point as in the slides, the name of the geom we need is geom_line. In addition, in the aesthetics definition aes() you should define a grouping variable group = 1. Otherwise, ggplot assumes that you want to plot one line for each year.

Admittedly, this may not be the best approach to identify differences between the periods directly. We don’t know when our periods start and when they end. Luckily, this can be done at least two different ways. Let’s start with the first one: using colors for different periods. For this purpose, we need an indicator variable as a grouping variable, so that we can use different colors for the line at each period.

3

Create an indicator variable for the time periods 1960-1969, 2002-2011 and the time inbetween.
A combination of mutate() and the if_else lets you create the new variables we need. To get some sensible legend labels later you should specify the indicator variables as strings.

After we’re set up with our indicator variable, it’s plotting time again. We can simply reuse our code from before and define a grouping color in the aesthetics definition.

4

Plot the line plot once again, but this time with different colors for the different time periods.
In the aesthetics defintion aes(), you can choose the option color = indicator_variable to define the grouping.

Now we can see some visual differences between the different periods. One last thing, however, is that there are way too many labels on the x-axis. Maybe a more sensible axis labeling approach would be to create axis breaks for every ten years steps. However, this is an advanced exercise as we did not talk about manipulating axes before. If you’re not feeling adventurous just jump to the next exercise which is also optional.

5 (advanced)

Create some prettier, i.e., more sensible breaks for the x-axis.
You can modify the x-axis with scale_x_discrete() and its breaks with the option breaks = breaks_vector.

Thus far, we only looked at univariate relationships. But the power of data visualization lies in revealing multivariate relationships. Have a look at the following code:

library(gapminder)

gapminder %>% 
  filter(??? == ???) %>% 
  ggplot(aes(x = ???,
             y = ???,
             size = ???,
             color = ???)) +
  geom_???()

6 (optional)

Please fill in the missing parts marked by a ???. Say, we aim to

  1. plot data of the year 2007,

  2. the relationship between GDP per capita and life expectancy,

  3. adjust the visualizations by population size,

  4. maybe have different colors for continents,

  5. and choose a proper to geom to plot everything.
Think about in which kind of plot different sizes of geom-types actually make sense and are on point.

There’s always more. Here’s one more as you may have noticed that the scaling of the x axis is a little bit odd, we see a non-linear relationship. ggplot2 makes it really easy with some built-in functions to rescale axes.

7 (optional)

Transform the x-axis to a log10 scale.
Have a look at the function scale_x_log10().